Schibsted YAMS

How to build and maintain a thousands/req service with minimal dedication

Who are you?



Daniel Caballero

Devops/SRE Engineer @ Schibsted

Part time (Devops) lecturer @ La Salle University

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

I really don't like to waste it



  • Resolving incidents
  • Reactive work
  • Repetitive work

Schibsgrñvahed..WHAT??

What is Schibsted?

And SPT?

It's about convergence through global solutions

What's behind global components / services?

You build it, you run it

Nothing new in the horizon probably for you

That means there's no ops/support/systems/devops team

{
    "format": "webp",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

Why not offline transformations?

  • Lots of (user) contents. Reprocessing hurts
  • Sites are dynamic by nature. Some of them do adapt the content to the device.

This may sound familiar to you...

CDNs able to transform contents on the fly:

  • As a native functionality...
  • Or through lambdas / edge computing

SaaS solutions:

Opensource solutions:

So...

Why did you invest time on that?

Why are you here?

Availability

Low latency

Low costs

High usage

Does not require high maintenance

  • (Almost) No incidents.
  • New sites do not require high onboarding efforts.

We would be able to maintain this with half an engineer

We don't (usually) like to cut people in half, so let's say one engineer

But be careful: if you stop developing a service, you kill the service

  • Stops being competitive
  • It quickly becomes legacy
  • Disconnects from current business needs

So we try to convince the company it requires, at least, the focus of two engineers.

But oncall rotations

Ok. Let's say 3-4. And we accept an extra project.

And we are the owners of the backlog

Despite sometimes is not so useful...

How did you achieved that?

Combination of...

  • Team
  • Product
  • Tech

Don't you see the similarity?

Team

Autonomy

Benefiting from other Sch services

Reusability of other colleagues code/components.

Big department portfolio:

  • AWS bootstrap
  • Vulnerability scans
  • TravisCI, Artifactory

Collaboration + transparency mindset

Internal RFCs Consumers as contributors Internal opensource model (full visibility of Github repos)

Product

Actual need

Limited scope

API as the point of interaction No business logic. "Dumb" service Almost no-functionality that is used by a single site or no-one

Tech

Everything as code

No space for "one time" actions.

  • Alerting configuration by code
  • Infrastructure updates

Good design choices

(but not perfect / or the best, for sure)

  • Immutable pattern
  • AWS + Netflix stack + Microservices
  • libvips
  • Non-blocking services

Continuous Delivery

And capacity to incorporate everything to the pipeline. > Looking forward, rather than investing lots of time in your rollback strategy

Small deltas. Iterative deliveries. Low risk deployments.

0-error target

Yeah, Google SRE book and error budgets...

... but helped us to understand, tune, and get the trust from Sch sites, avoiding major disruptions when big sites onboarded, and minimizing the chance of "unplanned / reactive" activities

We also rely in a "good enough" test suite (unit+integration+acceptance) with a good coverage of all API-functionality

  • New error conditions means new tests
  • If tests are green, almost (TM) no space for surprises

Obs+Troubleshooting toolkit

Enables experimentation culture

And what did you do wrong?

The Refactor (TM)

Complete refactor. New platform in parallel to deliver a new version of the API

  • APIv0
  • APIv1

Microservices split

Domain driven design... coupling of some services

Nice solution... but

Why not docker/k8s?

  • Local tests
  • YAMS Portal/Frontend already there
  • Migration exercise

gRPC?

Why not a Service Mesh?

And Prometheus?

We may.

And it may be a good moment to consider opencensus.

Actual (& not so far) future

More elasticity to reduce costs

  • Changes in transformation rules means massive eviction
    • So we are a bit overscaled...
  • Better degradation and more efficient ASG triggers
    • Reusing cache if no capacity
    • Automatic ASG parameters adjustments
    • Minimize parallelization in the transformation pipe
    • Incoming queue

Extra compression

  • Currently libjpg-turbo
  • Good for performance, pretty decent results, but...
  • MozJPEG, api-compatible with libjpg
  • guetzli, from Google

Bringing the service closer to the business

  • Image uploader
  • Online image editor
  • Integration with data services
    • Automatic classification
    • Nudity detector
    • Car plate pixelation
  • More regions/cloud providers deployments

  • Video transcoding...

Actual transformation pipelines

More adoption?

Some major Marketplaces are not using the service, yet

Simulating dependencies failures

Hoverfly: similar in concept to the Simian Army from Netflix, but specialized in API degradations

Stress test as part of the pipeline

Before closing...

Are you going to opensource it?

  • Schibsted do support contribution to opensource projects
  • As well as releasing internal code
  • Problem: Not following a "contribute-first" approach
  • But already contributed to bimg, zuul, krakenD...

Are you going to offer this SaaS to other companies?

Latencymap

api noiser

Final reminder

Be Rx in the code...

But not in real life

Keep the heros in your comics

Great thanks...

Sch*

And especially...

Edge colleagues

Other Qs?

dan . caba at google (dot)com

Your opinion is very important to me

  • Find my lecture on the schedule in the eventory app
  • Rate and comment my performance

Thanks for your feedback, I will know what to improve